Project Sekai - 🩸-misc-mlm

Project Sekai

🔒 GDG Algiers CTF 2022 / 🩸-misc-mlm

MLM - 500 points

Category: Misc Description: > We've captured some traffic destined to an AI student, can u analyse it? Link : https://mega.nz/file/S9JzSRjQ#jFjg_DO93t5xAwp-f5muyCvm_TlcSEkhzJjE6g8qI6I Author : Aymen Files: No files. Tags: AI, forensics

Sutx pinned a message to this channel. 10/07/2022 11:01 AM

@crazyman ai wants to collaborate

@Violin wants to collaborate

@Violin can u help me extract info from pcap? it may be a ML model inside? 300+MB

Image attachment

20:14

FTP stream, I think it transfers layer153.pkl

20:14

But I dont know how to extract it?

20:15

220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 2 of 5 allowed.
220-Local time is now 10:18. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
USER alBERT
331 User alBERT OK. Password required
PASS dBASE
230 OK. Current directory is /
CWD .
250 OK. Current directory is /
TYPE I
200 TYPE is now 8-bit binary
PASV
227 Entering Passive Mode (127,0,0,1,117,49)
RETR layer153.pkl
150-Accepted data connection
150 2304.2 kbytes to download
226-File successfully transferred
226 0.001 seconds (measured here), 1529.63 Mbytes per second

20:17

a lot of stuff here

Image attachment

you can write an script

22 0.218089278 172.17.0.1 172.17.0.2 FTP 83 Request: RETR layer0.pkl to 56336 61.054982975 172.17.0.1 172.17.0.2 FTP 85 Request: RETR layer201.pkl

20:19

202 layers

20:20

wtf

is this whole file pickle?

Image attachment

20:20

The large pcap has 202 transfers

20:20

each is a .pkl

20:20

so we need to download all

20:20

and batch process

can u show me the hex?

Image attachment

20:22

but it has some text and looks like pick;e

20:24

another one

Image attachment

20:28

Image attachment

20:28

seems the same format

20:29

80 04 95 8c 0c

20:30

so we need to extract tcp.stream eq 1,3,5,...,403

20:33

an example

3.15 KB

20:36

ahh

20:36

DecisionTreeClassifier()

20:36

import pickle

# unpickle data from test.pkl
with open('model.pkl', 'rb') as f:
    clf = pickle.load(f)
# print clf
print(clf)

20:39

i can get all 202 models i think

0.pkl is very large, 93MB

20:51

others small

20:51

but got error when decoding

i can extract all .pkl now

21:09

from 0 to 201, but my code labels it as 1.pkl to 403.pkl

21:09

#!/bin/bash
for i in {1..404..2}
do
   tshark -r Capture.pcapng -Y usb -z follow,tcp,raw,$i > session_$i.pkl
done

now need to write a script to remove header and tail

21:28

got all pickle data

21:35

there are 202 arrays, each array has a lot of numbers (edited)

21:35

they are either all close to 0 or all close to 1

21:35

>>> res = ""
>>> for i in pks:
...     xd = max(i)
...     if xd > 0.5:
...             res += "1"
...     else:
...             res += "0"

21:36

res:

0001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000010

21:36

max() of each array is 0.10931525 0.091124006 0.060824804 1.0021642 0.0017029976 0.10248919 0.001057469 0.09358651 1.2337082e-08 0.09289929 0.0008880249 0.09456936 0.000891081 1.0021415

Ok Summary

21:44

Basically I extracted all tcp streams 1,3,...,403 from pcap, got 0.pkl to 201.pkl, loaded pickle, and noticed in each numpy array, numbers are either all close to 0 or all close to 1. Then if I use "0" to represent all close to 0, and "1" for all close to 1, I got a binary string

0001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000010

But nothing seems related to flag. I am pretty sure it's correct, but I did not see flag inside the binary string.

21:45

1.46 MB

21:45

202 arrays are all here

sahuang

Basically I extracted all tcp streams 1,3,...,403 from pcap, got 0.pkl to 201.pkl, loaded pickle, and noticed in each numpy array, numbers are either all close to 0 or all close to 1. Then if I use "0" to represent all close to 0, and "1" for all close to 1, I got a binary string

0001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000000000100000100000000010000010000000001000001000010

But nothing seems related to flag. I am pretty sure it's correct, but I did not see flag inside the binary string.

ok author said wrong track, its AI challenge so numbers represent other stuff not just small or big

21:55

He hinted we need to know whats MLM

22:00

Minimal Learning Machine?

matrix?

22:04

import numpy as np
t=np.load('xxx',allow_pickle=True)
print(t)

no need allow_pickle=True

22:06

oh np.load

22:06

it's the same as pickle load

22:07

Image attachment

22:07

gives an array

22:07

there will be 202 arrays, each array has numbers either all close to 0 or 1

22:07

i will send the arrays here'

may be model parameters

22:11

it may have cut a whole model

yeah they are layer0 to layer201

Do you have any input?

are numbers weight or what

crazyman ai

Do you have any input?

no

22:12

but layer0.pkl is large

22:13

>>> for i in pks:
...     print(len(i))
... 
23440896
393216
1536
768
768
589824
768
589824
768
589824
768
589824
768
768
768
2359296
3072
2359296
768
768
768
589824
768
589824
768
589824
768
589824
768
768
768
2359296
3072
...
589824
768
768
768

22:13

length of 202 arrays

22:13

All multiples of 768

Can you give a compressed package that contains all the pkl files

yep

22:16

its very large

by some website?

sending

22:17

you can use this to load them to a list of numpy array

22:18

pks = []
for i in range(1, 404, 2):
    file = f"session_{i}.pkl"
    t = np.load(open(file, "rb"), allow_pickle=True)
    pks.append(t)

okay

387.59 MB

22:25

>>> pks[0]
array([ 3.0280282e-03, -1.7906362e-03,  5.7056175e-05, ...,
       -1.7809691e-02,  3.6876060e-02,  1.3254955e-02], dtype=float32)
>>> pks[3]
array([0.99998367, 0.9996013 , 1.0005598 , 1.0016987 , 0.99919254,
       1.0002677 , 1.0010219 , 1.0004221 , 0.9998995 , 1.0002527 ,
       1.0002414 , 0.99942666, 1.0006638 , 0.99949586, 1.0005087 , ...

An example of ~0 and ~1 (edited)

LOL admin said "Masked Language Modeling"

22:35

i dont know how that can do with these pkl's

22:42

issue is i dont see input (edited)

22:42

maybe someone can recheck pcap

22:43

checked again didnt see input

sahuang — Today at 11:20 PM Some layers provided have array dimension much larger than 768, which is BERT dimension per layer, I guess that's something I should sort out? Aymen — Today at 11:22 PM Yes

@Zafirr wants to collaborate

ok

sahuang

pks = []
for i in range(1, 404, 2):
    file = f"session_{i}.pkl"
    t = np.load(open(file, "rb"), allow_pickle=True)
    pks.append(t)

Summary

From given pcap, I extracted all pkl files (layer0.pkl to layer201.pkl) and attached in above zip mlm.zip
Using this script, we can get pks which is a 2D array of numpy array, each numpy array is weights of a given layer (0 to 201)
Challenge is about "Masked Language Modeling"

What is it? It is basically where you send input with a mask, e.g. I play CTFs ****, output is weekly.

What is flag?

sahuang — Today at 11:01 PM
Does it make sense to assume input is masked flag, output is the masked part(i.e. the text inside flag format?)
Aymen — Today at 11:02 PM
You're on the right track, try to read more about how MLM works and how you could use it to get the flag

What else? We basically need to 1) Get default BERT model, 2) modify weights to pks, 3) feed it flag format Cyber...{***(MASKED)***}, output is probably the word they want

Difficulty: Layers are not exactly 768 in dimension (All layers have a multiple of 768 in dimension)

sahuang — Today at 11:20 PM
Some layers provided have array dimension much larger than 768, which is BERT dimension per layer, I guess that's something I should sort out?
Aymen — Today at 11:22 PM
Yes

ok

https://towardsdatascience.com/masked-language-modelling-with-bert-7d49793e5d2c I'm referring to this article too

@afterworld wants to collaborate

hint makes little sense for people looking to know which model it is, take these pieces of information into consideration:

do you know what MLM stands for in ai? (already got it)
the username is your way to the model (uhh ok input is username?)
default config is being used (already knew)

sahuang

220---------- Welcome to Pure-FTPd [privsep] [TLS] ----------
220-You are user number 2 of 5 allowed.
220-Local time is now 10:18. Server port: 21.
220-This is a private system - No anonymous login
220-IPv6 connections are also welcome on this server.
220 You will be disconnected after 15 minutes of inactivity.
USER alBERT
331 User alBERT OK. Password required
PASS dBASE
230 OK. Current directory is /
CWD .
250 OK. Current directory is /
TYPE I
200 TYPE is now 8-bit binary
PASV
227 Entering Passive Mode (127,0,0,1,117,49)
RETR layer153.pkl
150-Accepted data connection
150 2304.2 kbytes to download
226-File successfully transferred
226 0.001 seconds (measured here), 1529.63 Mbytes per second

ah username is alBERT

08:02

so its confirmed BERT

ok going back to this ig

"Would be nice if you check number of layers of bert model and dimensions of each layer, this would definitely help u!" On the issue of layer not 768

@Guesslemonger wants to collaborate

sahuang

Summary

From given pcap, I extracted all pkl files (layer0.pkl to layer201.pkl) and attached in above zip mlm.zip
Using this script, we can get pks which is a 2D array of numpy array, each numpy array is weights of a given layer (0 to 201)
Challenge is about "Masked Language Modeling"

What is it? It is basically where you send input with a mask, e.g. I play CTFs ****, output is weekly.

What is flag?

sahuang — Today at 11:01 PM
Does it make sense to assume input is masked flag, output is the masked part(i.e. the text inside flag format?)
Aymen — Today at 11:02 PM
You're on the right track, try to read more about how MLM works and how you could use it to get the flag

What else? We basically need to 1) Get default BERT model, 2) modify weights to pks, 3) feed it flag format Cyber...{***(MASKED)***}, output is probably the word they want

Difficulty: Layers are not exactly 768 in dimension (All layers have a multiple of 768 in dimension)

sahuang — Today at 11:20 PM
Some layers provided have array dimension much larger than 768, which is BERT dimension per layer, I guess that's something I should sort out?
Aymen — Today at 11:22 PM
Yes

read the summary if you even wanna try

the forensics part is alr done so only BERT part left, the issue is i guess idk how to make those large-dimension layers be 768 (or maybe not needed)

can it be like we use only layers with 768 length?

23:14

i see default models have 12 layers, all with 768 length

yeah, but not only use 768 imo

sahuang

"Would be nice if you check number of layers of bert model and dimensions of each layer, this would definitely help u!" On the issue of layer not 768

they replied this when i said about default

ok we cap at 768 then maybe

sahuang
Any possible hint on Misc/MLM on layer dimension? There are a lot of layers with dimensions much larger than 768 (though multiple), which cannot be added to default BERT model (but hint said use all default configs)
Plus, default BERT has 12 layers only.

Aymen
Default bert layer dimensions are well known, you can reconstruct these from the given arrays
Think about how can u do it!

sahuang
Do you mean some sort of average on pooling layers technique?
e.g. take avg of a 2x2 and consider it as a weight value

Aymen
No you won't need that
Can I have a look at you're code?

sahuang
I wrote some code to get all 202 arrays, each having a different size
23440896, 393216, 1536, 768, 768, 589824...

Then I loaded a BERT default model (following some online tutorial) and try to add layers to it, but only 768 can be added, which is why I had that question

Aymen
Would be nice if you check number of layers of bert model and dimensions of each layer, this would definitely help u!

(edited)

Would be nice if you check number of layers of bert model and dimensions of each layer, this would definitely help u! does 12 layers, each with 768 length answer it?

yeah i checked

23:16

but no idea

23:16

also bert input has millions of features

23:17

BERT base — 12 layers (transformer blocks), 12 attention heads, 110 million parameters, and has an output size of 768-dimensions.

so we convert 202 layers to 12 layers somehow, all with 768 length

Guesslemonger

so we convert 202 layers to 12 layers somehow, all with 768 length

can you open a ticket and confirm w admin?

23:26

actually i doubt so

23:27

but just to make sure

23:27

its kinda sus all are multiples of 768

asked

any response?

nope

00:01

tagged author, offline i guess

ic

00:01

suppose you just get all 768-dim layers

00:02

need to write a script to load default BERT + add those layers

00:02

i did a lot of research, did not find a way to even change default weights of pre-trained BERT

00:02

so im suspecting they layers are to be added

yeah, layers to be added over default BERT, i am just reading on this, 0 xp

gotta sleep i will check back on this tmr morning

Guesslemonger — Today at 11:58 AM
hi, for mlm we have 202 layers whereas default BERT uses 12 layers, also dimension of many layers is over 768. Is the idea here to narrow down 202 layers to 12 layers first?
all with 768 dimension

Ouxs — Today at 12:15 PM
@Aymen

Guesslemonger — Today at 12:50 PM
author offline?

Guesslemonger — Today at 1:00 PM
ok so I researched a bit more, started with 0 knowledge of this. are these 202 pkl files for context? since default BERT won't know what to do with CyberErudites{[MASK]}
so we are trying to expand the vocabulary basically

Aymen — Today at 1:00 PM
Indeed BERT does have only 12 layers, but if you take a look at each layer you'll find that each one consists of query, key, value, dropout, ..

Guesslemonger — Today at 1:02 PM
is this the idea?

Aymen — Today at 1:03 PM
vocab has already been expended 

Guesslemonger — Today at 1:04 PM
umm so these files are some components of the layer which we can change
so that model identifies flag format 

Aymen — Today at 1:04 PM
you're not asked to change anything, only reconstruction

Guesslemonger — Today at 1:05 PM
so we have a default BERT model, we reconstruct it using these 202 files?

Aymen — Today at 1:05 PM
you're on the right track!

00:36

@sahuang

00:37

if you can make sense of it

hmm

00:39

i dont really know how to reconstruct

00:40

do u get the layer float arrays with the stuff i send previously? (edited)

00:42

The pks

yes, i can see all the numpy arrays

00:45

those aren't layers actually, bert has 12 layers but different components in each layer

00:45

those files represent some components somehow

true

00:46

so u mean default 12 layers weights should not even be changed

00:47

might want to check each layer's attributes

00:47

as he said query, key, value, dropout

yes

ok

00:48

i will check back in 8 hrs

"The layers had been flattened before being sent. You need to reshape them."

uhhh create layers from scratch then

Guesslemonger — Today at 7:38 PM still struggling with concept, do I recreate layers entirely with these arrays? does it require in depth BERT knowledge? looks like it does, 0 solves

Ouxs — Today at 7:40 PM you need to load the layers into the model and just like we stated in the last hint reshape must be done

theres probably some mathy way to know the correct dimensions

07:18

we know how long input and output are right?

i can't grasp the concept of loading layers into the model, bert already has 12 layers

07:19

i can't add more

07:19

since they have given 'default config'

what ml framework did they use

who?

07:20

it's MLM with BERT

oh bert just has default

07:20

i thought it was default from like a specific framework

yes, default config BERT already has 12 layers

07:22

>>> from transformers import pipeline

>>> model = pipeline('fill-mask', model='bert-base-uncased')

>>> pred = model("What is [MASK] name?")

>>> pred

[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}]

07:22

this is default config bert mlm

07:22

idk what are they asking to do with these pkl files

07:23

Guesslemonger — Today at 7:45 PM reshape after loading layers? issues is I can't load 202 layers into a 12 layer model. And if each array is some constituent of a layer, there is no way to know what is what Aymen — Today at 7:50 PM Try to access these layers with model.parameters()

07:23

most likely these pkl files are different parameters of the layer

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters)

(edited)

ok so it matches, I at least understood what these pkl files are

08:03

lol

wow

08:13

even first layer size?

08:14

i still dunno this chall output

08:14

the mask is a word, so flag is cyber...{a word}?

08:16

maybe try something and SE admin on why flag isnt as expected

they basically have given every parameter of bert model as a pickle file

08:21

it lines up

08:21

need to merge all arrays into a model

08:21

and we are done

oh ok

08:23

need to replace parameters with our arrays?

right

08:24

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters)

so basically every parameter of model is given as a separate pickle file and need to merge them to create bert model

<bound method ModuleUtilsMixin.num_parameters of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)

first part of output is word_embeddings, 30522 * 768 = 23440896 and length of first pkl file is indeed that

i didnt see any API to do that

hmm, will wait to see writeup, not that i would understand anything

https://huggingface.co/transformers/v3.5.1/model_doc/bert.html

08:26

are you using this?

08:26

can you send code so far

another hint

oh

08:27

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertForMaskedLM(config=BertConfig()) using model.parameters reshape and update the layers

08:28

https://huggingface.co/transformers/v3.5.1/model_doc/bert.html#bertconfig

08:28

this has parameters

08:28

looks doable?

08:28

but whats input

yeah parameters are same

hmm

08:30

looks close

08:30

i will start this in 30 mins

08:30

need to solve it

14.23 KB

08:31

this is file

if you can match it you can change weights i guess

08:32

whats the blocker

i have no idea how to do that

its in hint

08:32

what model.parameters return

08:33

i guess it can be modified

i have attached the file

08:33

model.parameters

o ok

08:34

is that transformers lib?

from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())
print(model.parameters)

(edited)

idk if size match 202 layers

08:39

but assume i can find a way to update (word_embeddings): Embedding(30522, 768, padding_idx=0) Do you know how to do the rest?

Should it not be same for every parameter?

yeah i just didnt see 202 stuff there

08:42

lemme redl pkl first lol

08:42

deleted them

Guesslemonger

from transformers import BertTokenizer, BertModel
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained("bert-base-uncased")
print(model.num_parameters)

so basically every parameter of model is given as a separate pickle file and need to merge them to create bert model

<bound method ModuleUtilsMixin.num_parameters of BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)

first part of output is word_embeddings, 30522 * 768 = 23440896 and length of first pkl file is indeed that

This

08:47

So on and so forth for each parameter

08:47

Lengths match

Image attachment

08:56

ok yeah

08:56

in total 202

08:58

works

Image attachment

08:59

ah i forgot to reshape

08:59

lul

08:59

weird it didnt give error

09:01

ok so embeddings are 2d, feature isnt

09:02

i feel close

sahuang

ok so embeddings are 2d, feature isnt

right

09:03

feature has in and out

damn

09:04

ok

09:04

updated weights

09:04

what next

09:05

from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())

shapes = []
for j, param in enumerate(model.parameters()):
    if j == 0:
        print(param.data)
    shapes.append(param.shape)

for j, param in enumerate(model.parameters()):
    # update param to our weights
    # if 2d, need to reshape pks[j]
    if len(shapes[j]) == 2:
        param.data = torch.from_numpy(pks[j]).view(shapes[j])
    else:
        param.data = torch.from_numpy(pks[j])

for j, param in enumerate(model.parameters()):
    if j == 0:
        print(param.data)
    assert param.shape == shapes[j]

Use this code to get it (pks is the array of pkl data)

09:05

it has verifying stuff too

Guesslemonger

>>> from transformers import pipeline

>>> model = pipeline('fill-mask', model='bert-base-uncased')

>>> pred = model("What is [MASK] name?")

>>> pred

[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}]

now do his

09:05

instead of bert-base-uncased, our model to be used

whats input

CyberErudites{[MASK]}

09:06

i guess

wtf

09:06

AttributeError: 'str' object has no attribute 'size'

09:06

weird

would need to dump new model as pickle first I think

doesnt seem my code issue, because even if i load original model it doesnt work and gives error

09:08

Image attachment

because model in my script is different

09:09

model = pipeline('fill-mask', model='bert-base-uncased')

ah ok

09:09

damn

09:09

need to research how to use my mode;

is it not model.pred?

umm if you dump model as pickle it shouldn't matter no?

idk how

09:10

and whats to do after pickle

replace 'bert-base-uncased' with pickle model name

09:11

i checked, bert-base-uncased is a bin file which is a pickle only

lemme try with default first

from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased',    return_dict = True)
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

to use input from https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209

How to use BERT from the Hugging Face transformer library

How to use BERT from the Hugging Face transformer library for four important tasks

Guesslemonger

replace 'bert-base-uncased' with pickle model name

doesnt work

09:13

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

'rb' issue

09:13

did you dump as 'rb' ?

wb

yes

with open('model.pkl', 'wb') as f: pickle.dump(model, f)

Zafirr

from transformers import BertTokenizer, BertForMaskedLM
from torch.nn import functional as F
import torch
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM.from_pretrained('bert-base-uncased',    return_dict = True)
text = "The capital of France, " + tokenizer.mask_token + ", contains the Eiffel Tower."
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

to use input from https://towardsdatascience.com/how-to-use-bert-from-the-hugging-face-transformer-library-d373a22b0209

trying this

09:17

doesnt work

09:17

give weird output

09:17

gonna ask admin

09:19

from torch.nn import functional as F
from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch
import pickle, numpy as np

pks = [] # store all the weights
for i in range(1, 404, 2):
    file = f"session_{i}.pkl"
    t = np.load(open(file, "rb"), allow_pickle=True)
    pks.append(t)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())

shapes = []
for j, param in enumerate(model.parameters()):
    shapes.append(param.shape)

for j, param in enumerate(model.parameters()):
    # update param to our weights
    # if 2d, need to reshape pks[j]
    if len(shapes[j]) == 2:
        param.data = torch.from_numpy(pks[j]).view(shapes[j])
    else:
        param.data = torch.from_numpy(pks[j])

text = "CyberErudites{" + tokenizer.mask_token + "}"
input = tokenizer.encode_plus(text, return_tensors = "pt")
mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
output = model(**input)
logits = output.logits
softmax = F.softmax(logits, dim = -1)
mask_word = softmax[0, mask_index, :]
top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
for token in top_10:
   word = tokenizer.decode([token])
   new_sentence = text.replace(tokenizer.mask_token, word)
   print(new_sentence)

Full code

09:19

the softmax part isnt default

09:19

idk if its true lol

09:19

we need a way to use bert transformer default predict

yeah

09:21

ask admin

asked

09:23

someone do feedback survey

09:24

lemme check more on predict

09:24

without using logits

@jayden wants to collaborate

did some search, seems softmax is required

Guesslemonger

>>> from transformers import pipeline

>>> model = pipeline('fill-mask', model='bert-base-uncased')

>>> pred = model("What is [MASK] name?")

>>> pred

[{'score': 0.5362833738327026, 'token': 2115, 'token_str': 'your', 'sequence': 'what is your name?'}, {'score': 0.260379433631897, 'token': 2014, 'token_str': 'her', 'sequence': 'what is her name?'}, {'score': 0.14665310084819794, 'token': 2010, 'token_str': 'his', 'sequence': 'what is his name?'}, {'score': 0.036417704075574875, 'token': 2026, 'token_str': 'my', 'sequence': 'what is my name?'}, {'score': 0.004835808649659157, 'token': 2049, 'token_str': 'its', 'sequence': 'what is its name?'}]

i hope we can use this lol

sahuang

i hope we can use this lol

this model loads a json file instead of pickle

09:28

that's why that error

Aymen — Today at 9:28 AM Flag is not composed by a single token !

09:30

sahuang — Today at 9:28 AM Uh we should know how many tokens there are? Aymen — Today at 9:29 AM No just place the most promising one at each time sahuang — Today at 9:29 AM none is promising here is flag even here? for first word Aymen — Today at 9:29 AM Flag is multiple chars

09:30

lol

https://huggingface.co/transformers/v3.5.1/model_doc/bert.html#bertformaskedlm (edited)

09:30

this has example

actually i feel the logit part is correct

09:31

but they said multiple masls

09:31

i cannot add more masks in that one

does it give any output

09:33

admin might mean that add 1st suggestion everytime

09:33

to flag

u can run it

09:33

gives l then random stuff

09:33

the code i pasted above

09:33

you can directly run it

text = tokenizer.mask_token

09:35

do this

09:35

first suggestion is cyber

09:35

then keep adding suggested word

wdym

text = "CyberErudites{" + tokenizer.mask_token + "}"

09:36

replace this with

oh

text = tokenizer.mask_token

09:36

first suggestion is cyber

well but later there's random stuff

09:36

Aymen — Today at 9:36 AM ## mean that this should be placed right after not space separated

09:37

what does this mean

hmm so cyber##er

09:37

is cyberer

oh

part of cybererudites

damn

09:37

ah

09:37

yeah

09:37

lmao

yep works

yep

go for it

long flag

sahuang

used /ctf

Well done, you got first blood!

nice

gg

09:42

well done

i dont think i helped much but thanks for credit

you actually ended it

09:43

with implementation

yeah logits are needed

09:44

but yeah

09:44

huge works

09:44

i thought hard part is get tensor arrays

09:44

but its actually second part

09:44

also

09:44

copilot is too good

09:44

for j, param in enumerate(model.parameters()):
    # update param to our weights
    # if 2d, need to reshape pks[j]
    if len(shapes[j]) == 2:
        param.data = torch.from_numpy(pks[j]).view(shapes[j])
    else:
        param.data = torch.from_numpy(pks[j])

Everything other than comment is done by copilot

09:45

especially torch.from_numpy(pks[j]).view(shapes[j])

from torch.nn import functional as F
from transformers import BertModel, BertConfig, BertTokenizer, BertForMaskedLM
import torch
import pickle, numpy as np

pks = [] # store all the weights
for i in range(1, 404, 2):
    file = f"session_{i}.pkl"
    t = np.load(open(file, "rb"), allow_pickle=True)
    pks.append(t)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForMaskedLM(config=BertConfig())

shapes = []
for j, param in enumerate(model.parameters()):
    shapes.append(param.shape)

for j, param in enumerate(model.parameters()):
    # update param to our weights
    # if 2d, need to reshape pks[j]
    if len(shapes[j]) == 2:
        param.data = torch.from_numpy(pks[j]).view(shapes[j])
    else:
        param.data = torch.from_numpy(pks[j])

flag = ''

while not flag.endswith('}'):
    text = flag + tokenizer.mask_token
    input = tokenizer.encode_plus(text, return_tensors = "pt")
    mask_index = torch.where(input["input_ids"][0] == tokenizer.mask_token_id)
    output = model(**input)
    logits = output.logits
    softmax = F.softmax(logits, dim = -1)
    mask_word = softmax[0, mask_index, :]
    top_10 = torch.topk(mask_word, 10, dim = 1)[1][0]
    word = tokenizer.decode([top_10[0]])
    new_sentence = text.replace(tokenizer.mask_token, word)
    flag = new_sentence.replace('##','')
print(flag)

(edited)

09:45

final script, prints flag in 1 go (edited)

nice

09:45

i am curious how he generated this model tbh

09:45

interesting

train model on flag instead of entire dictionary or books (edited)

09:46

i think creating is easier than reversing

true

commit code in our repo i guess, should be super useful in future

yeah

09:48

also try use jupyter notebook for AI stuff

09:48

its much better than python scripting

yeah

right, don't need to reload everything

because cells have states and no need to rerun

we could also use the do creds for server with gpu

its fast xd

09:49

@Guesslemonger we should investigate how to mine cryptos with DO credits next week lol

crypto

this ctf has a prize of another $1000 DO credits, and we have $300 now

09:50

from what i saw earning rate is extremely low but this $1000 is free so we may get coffee

lmao i think its against their terms to do that

yeah, not travelling much next week

09:50

it's not

jayden

lmao i think its against their terms to do that

no i checked

oh damn

they allow it

09:50

just low rate

09:50

just need to figure out how

better to run this and help lots of people https://boinc.bakerlab.org/

i didnt see any other usages

oh

09:51

yeah sure

09:51

good one

09:51

still we have $1000 this will prob only take dozens

theres lots of projects like that btw

09:51

https://foldingathome.org/?lng=en-US

Front Page - Folding@home

how long does the 1000 last

09:51

3 months?

i used to run this on my free tier aws

09:52

when i had one

yes only 3 months

09:52

lmfao

09:52

DO is shit

oh bruh theres no gpu instances

09:53

yeah itll prob get like 5 dollars from this

really?

09:53

yeah

since only cpu instances are available

ah

do is really shit lol

i think u can apply for quota

1000 credits for 3 months? wtf?!!

09:53

how can anyone burn that

yeah digitalocean is a bad sponsor

yeah lmao

Guesslemonger

1000 credits for 3 months? wtf?!!

thats why i said mining

https://ideas.digitalocean.com/core-compute-platform/p/add-gpu-instances yeah gpu instances are only a feature request rn

09:54

mining isnt gonna give very much

this can be used for experiment

Image attachment

don't think they will ever do that, lot of headache

09:54

for them to prevent mining

ye

sell credits lmao, there has to be some site

u cannot transfer it

yeah non transferable

sell account I meant lmao

Image attachment

09:56

100k

wtf

yeah, got under an incubator program

rent server is the only way

09:57

but then need to set up store and shit

09:57

not sure how it works

hm idk if thats a very good idea if they only last 3 months

true

i might do a writeup on this challenge

1

10:23

quite a lot of stuff

oh just saw this is the last unsolved chall

Exported 424 message(s)